Search CORE

9 research outputs found

Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition

Author: Hauptmann Alexandar G.
Lan Zhenzhong
Li Xuanchong
Publication venue
Publication date: 29/08/2014
Field of study

Historically, researchers in the field have spent a great deal of effort to create image representations that have scale invariance and retain spatial location information. This paper proposes to encode equivalent temporal characteristics in video representations for action recognition. To achieve temporal scale invariance, we develop a method called temporal scale pyramid (TSP). To encode temporal information, we present and compare two methods called temporal extension descriptor (TED) and temporal division pyramid (TDP) . Our purpose is to suggest solutions for matching complex actions that have large variation in velocity and appearance, which is missing from most current action representations. The experimental results on four benchmark datasets, UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach and significantly outperform state-of-the-art methods. Most noticeably, we achieve 65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51 and Hollywood2 datasets which constitutes an absolute improvement over the state-of-the-art by 7.8% and 3.9%, respectively

arXiv.org e-Print Archive

CiteSeerX

Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition

Author: Hauptmann Alexander G.
Lan Zhenzhong
Li Xuanchong
Lin Ming
Raj Bhiksha
Publication venue
Publication date: 01/01/2015
Field of study

Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This attenuation introduces bias to the resulting features and generates ill-conditioned feature matrices. The Gaussian Pyramid has been used as a feature enhancing technique that encodes scale-invariant characteristics into the feature space in an attempt to deal with this attenuation. However, at the core of the Gaussian Pyramid is a convolutional smoothing operation, which makes it incapable of generating new features at coarse scales. In order to address this problem, we propose a novel feature enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks features extracted using a family of differential filters parameterized with multiple time skips and encodes shift-invariance into the frequency space. MIFS compensates for information lost from using differential operators by recapturing information at coarse scales. This recaptured information allows us to match actions at different speeds and ranges of motion. We prove that MIFS enhances the learnability of differential-based features exponentially. The resulting feature matrices from MIFS have much smaller conditional numbers and variances than those from conventional methods. Experimental results show significantly improved performance on challenging action recognition and event detection tasks. Specifically, our method exceeds the state-of-the-arts on Hollywood2, UCF101 and UCF50 datasets and is comparable to state-of-the-arts on HMDB51 and Olympics Sports datasets. MIFS can also be used as a speedup strategy for feature extraction with minimal or no accuracy cost

arXiv.org e-Print Archive

CiteSeerX

Crossref

Which information sources are more effective and reliable in video search

Author: CHENG ZHIYONG
Hauptmann Alexander G.
LI Xuanchong
SHEN Jialie
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2016
Field of study

Institutional Knowledge at Singapore Management University

Strategies for Searching Video Content with Text Queries or Video Examples

Author: Chang Xiaojun
Du Xingzhong
Gan Chuang
Hauptmann Alexander G.
Jiang Lu
Lan Zhenzhong
Li Huan
Li Xuanchong
Lin Ming
Ma Zhigang
Mao Zexi
Meng Deyu
Xu Shicheng
Xu Zhongwen
Yang Yi
Yu Shoou-I
Publication venue
Publication date: 01/01/2016
Field of study

The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

A unified framework with a benchmark dataset for surveillance event detection

Author: Chang Xiaojun
Chen Qi
Du Xingzhong
Hauptmann Alexander G.
Li Xuanchong
Su Fei
Zhao Yanyun
Zhao Zhicheng
Publication venue: 'Elsevier BV'
Publication date: 14/09/2017
Field of study

As an important branch of multimedia content analysis, Surveillance Event Detection (SED) is still a quite challenging task due to high abstraction and complexity such as occlusions, cluttered backgrounds and viewpoint changes etc. To address the problem, we propose a unified SED detection framework which divides events into two categories, i.e., short-term events and long-duration events. The former can be represented as a kind of snapshots of static key-poses and embodies an inner-dependencies, while the latter contains complex interactions between pedestrians, and shows obvious inter-dependencies and temporal context. For short-term event, a novel cascade Convolutional Neural Network (CNN)-HsNet is first constructed to detect the pedestrian, and then the corresponding events are classified. For long-duration event, Dense Trajectory (DT) and Improved Dense Trajectory (IDT) are first applied to explore the temporal features of the events respectively, and subsequently, Fisher Vector (FV) coding is adopted to encode raw features and linear SVM classifiers are learned to predict. Finally, a heuristic fusion scheme is used to obtain the results. In addition, a new large-scale pedestrian dataset, named SED-PD, is built for evaluation. Comprehensive experiments on TRECVID SEDtest datasets demonstrate the effectiveness of proposed framework

University of Queensland eSpace

CMU-informedia @ TRECViD 2014 semantic indexing

Author: Armagan Anil
Chang Xiaojun
Duygulu-Sahin Pinar
Hauptmann Alexander
Jiang Lu
Lan Zhengzhong
Li Xuanchong
Mao Zexi
Meng Deyu
Yang Yi
Yu Shoou-I
Publication venue: Schloss Dagstuhl
Publication date: 01/01/2014
Field of study

Monash University Research Portal

Aspects of Bai Culture

Author: Backus C.
Colin Mackerras
Du Yuting
Fitzgerald C.P.
Goullart P.
Guojia tongji ju [State Statistical Bureau] (comp.)
Hsu F.L.K.
Li Qinghai
Li Siquan et al.
Li Zhengqing
Ma Yin
Ma Yin
Qiu Xuanchong
Qu Liuyi
Shen Qirong
Shi Zhengyi et al.
State Statistical Bureau
Wang Qun
Xu Jiarui
Xue Ziyan
Yang Yingkang
Yang Zhiyong
Yin Mingju
Yunnan renmin chubanshe [Yunnan People's Press] (comp.)
Zhang Wenxun
Zhang Xilu
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref

Informedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)

<p>We report on our system used in the TRECVID 2013 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of four main steps: extracting features, representing features, training detectors and fusion. In the feature extraction part, we extract more than 10 low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words, Gaussian Mixture Model Super Vectors (GMM) and Fisher Vectors. In the detector training and fusion, two classifiers and weighted double fusion method are employed. The official evaluation results show that our MED full systems achieve the best scores on Ah-Hoc EK10 and EK0, our audio systems achieve the best scores in EK100 and EK10 for both Pre-specified and Ad-Hoc tasks. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated.</p

FigShare